Background

PI: Dr. Vasily Yakovlev
Contact:

The PI is interested in exploratory analysis and would like to find some signature or prediction biomarker
that could help predict the survival rate. The initial focus is on the Differential Expression analysis of
the RNAseq data from the TCGA Database and the aim is to identify all statistically significant differences between the two groups highlighted below. The samples are Human
from the colo-rectal patients of the TCGA database with two locations; Rectum and Left colon.

This report presents the differential expression analysis of colon and rectum cancer datasets. The STAR
Counts were downloaded and the analysis was performed using DESeq2 package. Finally, the results were
visualized through various plots, including a volcano plot and a heatmap.

Sample info:
the READ (rectum) cohort: 172 unique case submitter IDs
the COAD (colon) cohort: 461 unique case submitter IDs further stratified into sub-localizations
localization ‘descending colon’,
localization ‘sigmoid colon’,
localization ‘splenic flexure’

Analysis includes pre-processing of the TCGA data, Principal Component Analysis (PCA), Differential Gene
Expression Analysis, visualizations and a survival plot for the two cohorts of interest.

Data Preparation

Data Download

From the TCGA database; the READ (rectum) cohort and the COAD (colon) cohorts were stratified. The PI would
like to compare the “rectum” group with the “left colon” group which includes localizations
‘descending colon’, ‘sigmoid colon’, and ‘splenic flexure’. The data included gene expression
quantification with STAR counts.

Data Preprocessing

The data underwent pre-processing to prepare the data for downstream analysis, inluding data
wrangling, and data normalization. This step included variance stabilizing transformation (vst)
and Deseq2 package normalization to ensure data comparability across samples.

Visualization

PCA plot

Several PCA plots were generated to visualize the variation in the data. Samples that are similar
to each other will cluster together in the PCA plot. PCA transforms a large set of variables into a
smaller one that still contains most of the information in the large set. It does this by identifying
the directions (principal components) in which the data varies the most. The data was first normalized (vst)
so that the variance becomes independent of the mean. The axes of a PCA plot represent the principal
components. Typically, the first two principal components (PC1 and PC2) are plotted, as they capture the
most variance. Additionally, a 3D PCA plot was also produced to visualize the data with the
third Principal Component.

Volcano Plot

A volcano plot was generated to visualize the differential expression results. This plot displays -> Log2 Fold Change (x-axis): Indicates the magnitude of expression change between colon and rectum samples. -> -Log10 Adjusted p-value (y-axis): Represents the statistical significance of the expression change. -> Significance Thresholds: Horizontal and vertical lines on the plot indicate thresholds for significance
and fold change, helping to highlight the most differentially expressed genes.

Heatmap

A heatmap was created to show the expression levels of the top differentially expressed genes. Key features
of the heatmap include -> Expression Patterns: It visualizes the expression patterns of these genes across samples, facilitating the
identification of clusters or patterns in gene expression. -> Clustering: Both genes and samples are clustered to reveal similarities and differences in expression profiles.

Code Release Statement

This R Markdown file contains code for analyzing TCGA data and to perform PCA analysis,
Differential Gene Expression Analysis, and other bioinformatics visualization. The code provided is released
under the MIT License and is intended for use in research and educational projects.

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and
associated documentation files (the “Software”), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the
following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial
portions of the Software.

Disclaimer:
The software is provided “as is”, without warranty of any kind, express or implied, including
but not limited to the warranties of merchantability, fitness for a particular purpose, and
non-infringement. In no event shall the authors or copyright holders be liable for any claim,
damages, or other liability, whether in an action of contract, tort, or otherwise, arising from, out of,
or in connection with the software or the use or other dealings in the software.

Publication Policy

As part of our commitment to transparency and scientific collaboration, our bioinformatics services core
releases code and methods upon project completion. For generated code, we maintain a private GitHub
repository during project execution where investigators and collaborating students can contribute.
Methods are written throughout the project lifecycle and are part of our core’s deliverables. Upon
publication, the repository becomes public and is released under an open-source license,
ensuring that others can build upon and benefit from our work, with the exception of code handling sensitive data.

Our core adheres to strict data security practices. Any code handling sensitive or confidential data undergoes
rigorous review to ensure compliance with privacy regulations and may or may not be publicly
released according to VCU’s data security and privacy policies.

We require that any results obtained from code generated during our collaboration include a citation to the
GitHub repository, acknowledging the contributions of our analysts. Additionally, the BISR and its source of
funding, the CCSG grant, must be included in the acknowledgment and funding sections of manuscripts.

Differential Expression Analysis


* Colon vs. Rectum
(Control level: Rectum)

Visualization

Comparision 1 Treatment: Colon vs. Control: Rectum

Comparision 2 Treatment: Descending Colon vs. Control: Rectum

Comparision 3 Treatment: Sigmoid Colon vs. Control: Rectum

Comparision 4 Treatment: Splenic Flexure of Colon vs. Control: Rectum

PCA

Expectation: We would expect to see samples that are similar to each other cluster together.

3D PCA

Comparision 1 Treatment: Colon vs. Control: Rectum

Comparision 2 Treatment: Descending Colon vs. Control: Rectum

Comparision 3 Treatment: Sigmoid Colon vs. Control: Rectum

Comparision 4 Treatment: Splenic Flexure of Colon vs. Control: Rectum

Volcano plot

Volcano: Each dot represents a change in gene expression. X-axis: log2 fold-change of expression between treatment compared to the control plotted against the -log10(padj). The red line indicates the p-value < 0.05. Every point (gene) above that threshold appears to have statistically significant changes between the two conditions.

Vertical lines indicate 1.5 fold change. Genes highlighted in RED are up-regulated in the Treatment compared to the control. Genes highlighted in BLUE are down-regulated in the Treatment compared to the control. Genes in gray, do not meet the thresholds for both logFC and p-value.

Comparision 1 Treatment: Colon vs. Control: Rectum

Comparision 2 Treatment: Descending Colon vs. Control: Rectum

Comparision 3 Treatment: Sigmoid Colon vs. Control: Rectum

Comparision 4 Treatment: Splenic Flexure of Colon vs. Control: Rectum

Heatmap

A heatmap of zscore normalized read counts data for each comparison. The x-axis are samples, the y-axis are genes. The red color represents the magnitude of standard deviations above the mean for each read count (i.e., higher expression), and the blue is the magnitude of standard deviations below the mean (i.e., lower expression). White indicates that a read count is close to the mean. The dendrogram is clustering by samples and by RNA expression.

Package Citations

Package Version Citation
AnnotationDbi 1.66.0 @Annotat….
base 4.4.1 @base
ComplexHeatmap 2.20.0 @Complex….
DESeq2 1.44.0 @DESeq2
DT 0.33 @DT
enrichplot 1.24.2 @enrichplot
ggrepel 0.9.5 @ggrepel
gplots 3.1.3.1 @gplots
here 1.0.1 @here
janitor 2.2.0 @janitor
knitr 1.48 @knitr20….
org.Hs.eg.db 3.19.1 @orgHsegdb
pacman 0.5.1 @pacman
plotly 4.10.4 @plotly
RColorBrewer 1.1.3 @RColorB….
reticulate 1.38.0 @reticulate
rmarkdown 2.27 @rmarkdo….
RNASeqBits 0.1.0 @RNASeqBits
scales 1.3.0 @scales
TCGAbiolinks 2.32.0 @TCGAbio….
tidyverse 2.0.0 @tidyverse